Prioritizing hypothesis tests for high throughput data

نویسندگان

  • Sangjin Kim
  • Paul Schliekelman
چکیده

MOTIVATION The advent of high throughput data has led to a massive increase in the number of hypothesis tests conducted in many types of biological studies and a concomitant increase in stringency of significance thresholds. Filtering methods, which use independent information to eliminate less promising tests and thus reduce multiple testing, have been widely and successfully applied. However, key questions remain about how to best apply them: When is filtering beneficial and when is it detrimental? How good does the independent information need to be in order for filtering to be effective? How should one choose the filter cutoff that separates tests that pass the filter from those that don't? RESULT We quantify the effect of the quality of the filter information, the filter cutoff and other factors on the effectiveness of the filter and show a number of results: If the filter has a high probability (e.g. 70%) of ranking true positive features highly (e.g. top 10%), then filtering can lead to dramatic increase (e.g. 10-fold) in discovery probability when there is high redundancy in information between hypothesis tests. Filtering is less effective when there is low redundancy between hypothesis tests and its benefit decreases rapidly as the quality of the filter information decreases. Furthermore, the outcome is highly dependent on the choice of filter cutoff. Choosing the cutoff without reference to the data will often lead to a large loss in discovery probability. However, naïve optimization of the cutoff using the data will lead to inflated type I error. We introduce a data-based method for choosing the cutoff that maintains control of the family-wise error rate via a correction factor to the significance threshold. Application of this approach offers as much as a several-fold advantage in discovery probability relative to no filtering, while maintaining type I error control. We also introduce a closely related method of P-value weighting that further improves performance. AVAILABILITY AND IMPLEMENTATION R code for calculating the correction factor is available at http://www.stat.uga.edu/people/faculty/paul-schliekelman CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritizing Endocrine-Disruptor Screening Using ToxPi Visually translating the integration of ToxCast data

Prioritizing Endocrine-Disruptor Screening Using ToxPi Visually translating the integration of ToxCast data July 6, 2010 Impact Statement This research paper presents ToxPi (Toxicological Priority Index), a new weight-of-evidence framework for profiling and prioritizing chemicals. It numerically integrates various knowledge sources about chemical specific properties (biological and chemical). T...

متن کامل

Variance of the number of false discoveries

In high-throughput genomic work, a very large number d of hypotheses are tested based on n d data samples. The large number of tests necessitates an adjustment for false discoveries in which a true null hypothesis was rejected. The expected number of false discoveries is easy to obtain. Dependencies among the hypothesis tests greatly affect the variance of the number of false discoveries. Assum...

متن کامل

Testing for Stochastic Non- Linearity in the Rational Expectations Permanent Income Hypothesis

The Rational Expectations Permanent Income Hypothesis implies that consumption follows a martingale. However, most empirical tests have rejected the hypothesis. Those empirical tests are based on linear models. If the data generating process is non-linear, conventional tests may not assess some of the randomness properly. As a result, inference based on conventional tests of linear models can b...

متن کامل

Prioritizing Environmental Chemicals for Obesity and Diabetes Outcomes Research: A Screening Approach Using ToxCast™ High-Throughput Data

BACKGROUND Diabetes and obesity are major threats to public health in the United States and abroad. Understanding the role that chemicals in our environment play in the development of these conditions is an emerging issue in environmental health, although identifying and prioritizing chemicals for testing beyond those already implicated in the literature is challenging. This review is intended ...

متن کامل

Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas

In this work, four reliable aqueous solubility models, ASM-ATC (aqueous solubility model based on atom type counts), ASM-ATC-LOGP (aqueous solubility model based on atom type counts and ClogP as an additional descriptor), ASM-SAS (aqueous solubility model based on solvent accessible surface areas), and ASM-SAS-LOGP (aqueous solubility model based on solvent accessible surface areas and ClogP as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 32 6  شماره 

صفحات  -

تاریخ انتشار 2016